8 research outputs found

    Exploring query execution strategies for JIT vectorization and SIMD

    Get PDF
    This paper partially explores the design space for efficient query processors on future hardware that is rich in SIMD capabilities. It departs from two well-known approaches: (1) interpreted block-at-a-time execution (a.k.a. "vectorization") and (2) "data-centric" JIT compilation, as in the HyPer system. We argue that in between these two design points in terms of granularity of execution and uni

    Charting the design space of query execution using VOILA

    Get PDF
    Database architecture, while having been studied for four decades now, has delivered only a few designs with well-understood properties. These few are followed by most actual systems. Acquiring more knowledge about the design space is a very time-consuming processes that requires manually crafting prototypes with a low chance of generating material insight.We propose a framework that aims to accelerat

    Highlighting the performance diversity of analytical queries using VOILA

    Get PDF
    Hardware architecture has long influenced software architecture, and notably so in analytical database systems. Currently, we see a new trend emerging: A "tectonic shift" away from X86-based platforms. Little is (yet) known on how this shift affects database system performance and, consequently, should influence the design choices made. In this paper, we investigate the performance characteristics of X86, POWER, ARM and RISC-V hardware on micro- as well as macro-benchmarks on a variety of analytical database engine designs. Our tool to do so is VOILA: a new database engine generator framework that from a single specification can generate hundreds of different database architecture engines (called "flavors"), among which well-known design points such as vectorized and data-centric execution. We found that performance on different queries by different flavors varies significantly, with no single best flavor overall, and per query different flavors winning, depending on the hardware. We think this "performance diversity" motivates a redesign of existing – inflexible – engines towards hardware- and query-adaptive ones. Additionally, we found that modern ARM platforms can beat X86 in terms of overall performance by up to 2×, provide up to 11.6× lower cost per instance, and up to 4.4× lower cost per query run. This is an early indication that the best days of X86 are over

    Efficient query processing with Optimistically Compressed Hash Tables & Strings in the USSR

    Get PDF
    Modern query engines rely heavily on hash tables for query processing. Overall query performance and memory footprint is often determined by how hash tables and the tuples within them are represented. In this work, we propose three complementary techniques to improve this representation: Domain-Guided Prefix Suppression bit-packs keys and values tightly to reduce hash table record width. Optimistic Splitting decomposes values (and operations on them) into (operations on) frequently-accessed and infrequently-accessed value slices. By removing the infrequently-accessed value slices from the hash table record, it improves cache locality. The Unique Strings Self-aligned Region (USSR) accelerates handling frequently-occurring strings, which are very common in real-world data sets, by creating an on-the-fly dictionary of the most frequent strings. This allows executing many string operations with integer logic and reduces memory pressure. We integrated these techniques into Vectorwise. On the TPC-H benchmark, our approach reduces peak memory consumption by 2-4× and improves performance by up to 1.5×. On a real-world BI workload, we measured a 2× improvement in performance and in micro-benchmarks we observed speedups of up to 25×

    Optimistically compressed Hash Tables & Strings in the USSR

    Get PDF
    Modern query engines rely heavily on hash tables for query processing. Overall query performance and memory footprint is often determined by how hash tables and the tuples within them are represented. In this work, we propose three complementary techniques to improve this representation: Domain-Guided Prefix Suppression bit-packs keys and values tightly to reduce hash table record width. Optimistic Splitting decomposes values (and operations on them) into (operations on) frequently- and infrequently-accessed value slices. By removing the infrequently-accessed value slices from the hash table record, it improves cache locality. The Unique Strings Self-aligned Region (USSR) accelerates handling frequently occurring strings, which are widespread in real-world data sets, by creating an on-the-fly dictionary of the most frequent strings. This allows executing many string operations with integer logic and reduces memory pressure. We integrated these techniques into Vectorwise. On the TPC-H benchmark, our approach reduces peak memory consumption by 2–4x and improves performance by up to 1.5x. On a real-world BI workload, we measured a 2x improvement in performance and in micro-benchmarks we observed speedups of up to 25x

    Optimizing group-by and aggregation using GPU-CPU co-processing

    Get PDF
    While GPU query processing is a well-studied area, real adoption is limited in practice as typically GPU execution is only significantly faster than CPU execution if the data resides in GPU memory, which limits scalability to small data scenarios where performance tends to be less critical. Another problem is that not all query code (e.g. UDFs) will realistically be able to run on GPUs. We therefore investigate CPU-GPU co-processing, where both the CPU and GPU are involved in evaluating the query in scenarios where the data does not fit in the GPU memory.As we wish to deeply explore opportunities for optimizing execution speed, we narrow our focus further to a specific well-studied OLAP scenario, amenable to such co-processing, in the form of the TPC-H benchmark Query 1.For this query, and at large scale factors, we are able to improve performance significantly over the state-of-the-art for GPU implementations; we present competitive performance of a GPU versus a state-of-the-art multi-core CPU baseline a novelty for data exceeding GPU memory size; and finally, we show that co-processing does provide significant additional speedup over any of the processors individually.We achieve this performance improvement by utilizing parallelism-friendly compression to alleviate the PCIe transfer bottleneck, query-compilation-like fusion of the processing operations, and a simple yet effective scheduling mechanism. We hope that some of these features can inspire future work on GPU-focused and heterogeneous analytic DBMSes.</p

    VOILA

    No full text
    corecore